A paper of Alsinglawi et al was recently accepted and published in Scientific Reports. In this paper, the authors aim to predict length of stay (LOS), discretized into either long (> 7 days) or short stays (< 7 days), of lung cancer patients in an ICU department using various machine learning techniques. The authors claim to achieve perfect results with an Area Under the Receiver Operating Characteristic curve (AUROC) of 100% with a Random Forest (RF) classifier with ADASYN class balancing over sampling technique, which if accurate could have significant implications for hospital management. However, we have identified several methodological flaws within the manuscript which cause the results to be overly optimistic and would have serious consequences if used in a clinical practice. Moreover, the reporting of the methodology is unclear and many important details are missing from the manuscript, which makes reproduction extremely difficult. We highlight the effect these oversights have had on the result and provide a more believable result of 88.91% AUROC when these oversights are corrected.
translated by 谷歌翻译
特征选择是开发强大而强大的机器学习模型的关键步骤。特征选择技术可以分为两类:过滤器和包装器方法。尽管包装器方法通常会产生强大的预测性能,但它们具有很大的计算复杂性,因此需要大量时间完成,尤其是在处理高维度集合时。或者,滤波器方法的速度要快得多,但是遭受了其他几个缺点,例如(i)需要阈值值,(ii)不考虑特征之间的相互关系,并且(iii)忽略与模型的特征相互作用。为此,我们提出了一种新颖的包装器特征选择方法PowerShap,该方法将统计假设测试和功率计算与Shapley值结合使用,以进行快速和直观的特征选择。 PowerShap建立在核心假设的基础上:与已知的随机功能相比,信息功能将对预测产生更大的影响。基准和仿真表明,PowerShap的表现优于其他过滤器方法,具有与包装器方法相同的预测性能,同时显着更快,甚至达到执行时间的一半或三分之一。因此,PowerShap提供了一种竞争和快速算法,可以在不同域中的各种模型使用。此外,PowerShap是作为插件和开源的Sklearn组件实现的,可以轻松地集成在传统的数据科学管道中。通过提供自动模式,可以自动调整PowerShap算法的超参数,从而进一步增强用户体验,从而可以使用该算法而无需任何配置。
translated by 谷歌翻译